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Abstract — We propose a Bayesian image super-resolution (SR) 
method with a causal Gaussian Markov random field (MRF) 
prior. SR is a technique to estimate a spatially high-resolution 
image from given multiple low-resolution images. An MRF model 
with the line process supplies a preferable prior for natural 
images with edges. We improve the existing image transformation 
model, the compound MRF model, and its hyperparameter 
prior model. We also derive the optimal estimator - not the 
joint maximum a posteriori (MAP) or marginalized maximum 
likelihood (ML), but the posterior mean (PM) - from the objective 
function of the L2-norm (mean square error) -based peak signal- 
to-noise ratio (PSNR). Point estimates such as MAP and ML 
are generally not stable in ill-posed high-dimensional problems 
because of overfitting, while PM is a stable estimator because all 
the parameters in the model are evaluated as distributions. The 
estimator is numerically determined by using variational Bayes. 
Variational Bayes is a widely used method that approximately 
determines a complicated posterior distribution, but it is gener- 
ally hard to use because it needs the conjugate prior. We solve 
this problem with simple Taylor approximations. Experimental 
results have shown that the proposed method is more accurate 
or comparable to existing methods. 

Index Terms — super-resolution, Bayesian inference, Markov 
random field prior, line process, posterior mean, variational 
Bayes, Taylor approximation. 

L Introduction 

Super-resolution (SR) is an information processing tech- 
nique that makes it possible to infer a spatially high-resolution 
(HR) image of a scene from corresponding multiple low- 
resolution (LR) images that are affected by warping, blurring, 
and noise. SR can be applied to a variety of images; e.g., 
still images extracted from several sequential video frames. 
SR needs the registration of LR images in addition to the 
image restoration of the registered LR images. Since the 
earliest work by Tsai and Huang (B, SR has been achieved 
using various methods f2l-|T^ and good overviews of these 
methods are given in [11J-|16|. Generally, SR is an ill-posed 
inverse problem because inverting the blur process without 
amplifying the effect of the noise is difficult |13|. In other 
words, the degrees of freedom of the HR image and pixel- wise 
observation noise are always higher than the dimensionality 
of the observed LR images, so complete determination of 
an HR image is impossible. Therefore, the HR image is 
frequently inferred as the most preferable image within the 
framework of the probabilistic information processing, and we 
handle SR using this framework in this paper. The probabilistic 
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information processing has three key features: 1) model, 2) 
objective function, and 3) optimization method. In the SR 
problem, the model includes the observation model and the 
prior model. The observation model consists of warping, 
blurring, downsampling, and noise models. The prior model, 
necessary for the Bayesian framework, mainly consists of an 
HR image prior, and sometimes includes both the hyperparam- 
eter prior for the HR image prior and the registration prior. The 
objective function evaluates how good or bad an estimator is. 
The estimator usually represents the inferred HR image, and 
sometimes includes auxiliary parameters; e.g., the registration 
parameters and edge information. The optimization method 
numerically maximizes/minimizes the objective function and 
determines the estimator. An optimization method is not neces- 
sary for simple problems in which an analytical exact solution 
can be obtained. In the probabilistic information processing, 
SR can be categorized according to these three key features. 

To deal with warping, blurring, and downsampling, a linear 
transformation model is frequently used |3|, |6|, [fl, ifTOll . 
Warping is usually limited with planar rotation and parallel 
translation. Blurring is defined by using a point spread function 
(PSF); a square or Gaussian type PSF is common. Downsam- 
pling denotes sampling from an HR image to construct an LR 
image. Downsampling sometimes includes anti-aliasing. Since 
these three transformations are linear, they can be combined 
into a single transformation matrix. As for the noise model, 
pixel-independent additive white Gaussian noise (AWGN) is 
usually employed. 

The Bayesian framework, especially the HR image prior, is 
quite useful for SR. The HR image prior provides appropriate 
smoothness between neighboring pixel luminances. A com- 
mon type of HR image prior imposes an L2-norm penalty 
on differences between horizontally and vertically adjacent 
pixel luminances (the first derivative). The LI -norm of the 
first derivative is sometimes used, and it has the advantage 
of robust inference against outliers. The total variation (TV) 
prior 1 10] employs the LI -norm of the gradient vector. The 
Huber prior Q is a mixture prior of LI- and L2-norms. 
The SAR model ||2l, ||9l, IITtI employs the response of a 
two-dimensional Laplacian filter (the second derivative). The 
Gaussian process prior |3| has neighboring pixels spread 
according to a Gaussian distribution. Besides the degree of 
smoothness between neighboring pixels, information regarding 
the discontinuity, or equivalently, the edges or line process, 
is also useful for inference. A common type of prior imple- 
menting edges is the compound Markov random field (MRF) 
prior that was introduced by Geman & Geman ifTSl and is 
widely used 141, m, |0. With respect to the compound MRF 
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O, 1201 prior, the normalizing constant, or equivalently, 
the partition function, is usually difficult to calculate because 
it has an exponential calculation cost with respect to the 
dimensionality of the line process. Recently, Kanemura et al. 
(61, fS] confusingly introduced a "causal" type of Gaussian 
MRF prior whose calculation cost is polynomial. We try to 
improve this prior in this paper. 

The SR estimator should be derived from an objective 
function. As the objective function, a posterior distribution has 
been widely employed. Since the posterior distribution usually 
includes both the HR image and registration parameters, the 
joint maximum a posteriori (MAP) solution [2J is a suitable 
estimator for this objective function. Other than the joint MAP, 
the use of the marginalized maximum likelihood (ML) |3J, 01 
or marginalized MAP |5| has been proposed. Tipping et al. 
O and Kanemura et al. ||6l, O determine the registration 
parameters by using ML inference, where the HR image is 
marginalized out, and determine the HR image by using MAP 
inference. Pickup et al. determines the HR image by using 
MAP inference, wherein the registration uncertainties are 
marginalized out, and assumes that the registration parameters 
are pre-registered by using standard registration techniques. 
Marginalized ML is also called type-II ML, evidence ap- 
proximation, or empirical Bayes. Marginalized ML has no 
registration prior, unlike marginalized MAP. Pickup et al. 
(S i reported that marginalized MAP is superior to both joint 
MAP and marginalized ML. We evaluate the accuracy of 
SR methods in terms of the L2-norm (mean square error) - 
based peak signal-to-noise ratio (PSNR). Therefore, we think 
it is natural to employ PSNR as the objective function. For 
this objective function, posterior mean (PM) is a suitable 
estimator. The variational Bayes [21] approach fTO] seems to 
approximately determine the PM of the HR image, although 
the authors assume some registration parameters are known 
and use point-estimate model parameters obtained by ML 
inference. To determine the exact PM of the HR image, all 
parameters other than the HR image should be marginalized 
out over the joint posterior distribution. 

The type of optimization method to use is not as substantial 
a problem as the choice of model and objective function, but 
it is still important. Since almost all good estimators cannot be 
exactly determined because of difficult analytical integration or 
an exponential calculation cost, some approximation methods 
need to be introduced. Also, parameter tuning is necessary in 
many numerical optimization methods; e.g., of the initial value 
and the step- width settings in gradient methods. Specifically, 
in early work done on image restoration, an annealing method 
was used for the joint MAP solution ifTSl, ||22 l. For marginal- 
ized ML and marginalized MAP solutions, the scaled conju- 
gate gradients algorithm was used |3|, |5|. In recent work, 
the variational expectation-maximization (EM) algorithm has 
been applied, which includes the gradient method in the M 
step [6J, ISJ. The variational Bayes approach has also been 
applied ifTOll . This method includes nested optimization of 
the majorization-minimization approach. This majorization- 
minimization approach seems to affect both the HR image 
prior and the estimator. Specifically, it modifies the TV prior to 
include a discontinuity parameter (called local spatial activity). 



In addition, this parameter is point-estimated when the HR 
image is inferred. 

In this paper, we propose a new SR method that employs 
a "causal" Gaussian MRF prior and utilizes variational Bayes 
to calculate the optimal estimator, PM, with respect to the 
objective function of the L2-norm-based PSNR. This is a 
straightforward approach, but it was not proposed earlier 
possibly because an important limitation of variational Bayes 
is that a conjugate prior is needed. We solve this problem 
through simple Taylor approximations. In Section II, we define 
models, where we introduce a novel unified warping, blurring 
and downsampling model, an improved HR image prior, an 
improved hyperparameter prior, and a registration prior. In 
Section III, we employ PSNR as the objective function and 
derive the optimal estimator, PM, from this objective function. 
In Section IV, we determine the PM by using variational Bayes 
and Taylor approximations. In Section V, we evaluate the 
proposed method by comparing it with existing methods. We 
discuss the proposed method in Section VI and conclude in 
Section VII. 

II. Model 

A. Definitions 

First, we define the gamma, Bernoulli, and Gaussian distri- 
butions used in this paper: 

Gamma(x; a, h) = -— — ^''"-^e"^'^ {x > 0), 
I [a) 

Bernoulli(x; /i) = /i^(l - /i)^"^ {x e {0, 1}), 

Here, F is the gamma function, | • | denotes the determinant 
of a given matrix, superscript T denotes the transpose, M 
is the real number field, and d is the dimension of x. The 
logistic function and Kullback-Leibler (KL) divergence from 
distributions p{x) to q{x) are respectively defined as 

logistic(x) = . / 

1 + e ^ 

DKL{p{x)\\q{x)) ^ (\n^) , 

where the angle brackets (•)o denote the expectation of • with 
respect to a distribution o. Additionally, tr denotes the trace of 
a given matrix, diag denotes a diagonal matrix. I is an identity 
matrix of appropriate size. is a zero vector or a zero matrix 
of appropriate size. All the vectors in this paper are column 
vectors. The || • ||2 denotes the L2-norm of a given vector. At 
this point, these variables have absolutely nothing to do with 
the variables that appear later. 

B. Observation Model 

Our task is to estimate an HR grayscale image, x G M^'^, 
from the observed multiple LR grayscale images, Y = 
{yi}iLi^yi ^ M^^. Images yi and x are regarded as lexi- 
cographically stacked vectors. The number of pixels for each 
LR image, A^^, is assumed to be less than that of the HR 
image, Nx', i.e., Ny < Nx- We do this estimation using 
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Warped Blurred Downsampled Noise added 



Fig. 1. An illustration of the image observation process 



an SR technique whose resolution enhancement factor is 
a = y^NxJ^ (> !)• Although we define the range of a 
pixel luminance value as infinite, we use —1 for black, +1 for 
white, and values between —1 and +1 for gradual gray. 

The image observation process is modeled as shown in 
Fig. [11 the HR image x is geometrically warped, blurred, 
downsampled, and corrupted by noise e/ to form the observed 
LR image yr. 



yi W{(l)i)x^ei, 



(1) 



or, more strictly, 

L 

p{Y\x, /3, = llMyi'. W{ct^i)x, r^I). (2) 



^=1 



The ei G M ^ is AWGN with precision (inverse variance) 
P (> 0). Here, W{(j)i) is the Ny x Nx transformation 
matrix that is simultaneously used for warping, blurring, and 
downsampling. It is defined as 



A/'(x(^(,o,,Ci,ei);0,7r'-f) 



cos sm t 
- sin 6 cos ( 



{aC- oj 



(3) 



(4) 



where X represents the extent of the summation (explained 
in the next paragraph), and the vectors and Q respectively 
denote the two-dimensional positions of the i-th pixel of the 
original HR image and the j-th pixel of the observed LR 
image. We define the center of each image as the origin 
and the size of each pixel is 1 by 1. For example, regard- 
ing an HR image with 40 x 40 pixels, each ^ represents 
[-19.5, -19.5]^, [-18.5, -19.5]^, [19.5, 19.5]"^. Oi and oi 
represent the warping parameters of the l-th LR image: the ro- 
tational motion parameter and translational motion parameter. 
The Gaussian distribution in ^ represents a Gaussian PSF 
that defines the blur, and 7/ (> 0) represents its precision 
parameter. In this paper, we assume also differs for each 
observed image. These transformation parameters are packed 
into (f)i, which is defined as 



[^, 



[Oi, [oi]h, [oi]v,7i] 



(5) 



where subscripts h and v, respectively, denote horizontal and 
vertical positions on the image. 

In previous works 0, 0), El, the extent of I was defined 
as the extent of the HR image. According to this definition, 
however, the shape of the PSF is no longer Gaussian. For 
example, at the corner of the HR image, the shape is not 
omnidirectional but limited in a way such as that of a quadrant. 
In this paper, the extent of X is defined as infinite, and the 
luminance values outside the HR image are defined as 



(middle gray). This normalization term faithfully represents 
the Gaussian PSF. We also found that this normalization term 
is exactly given by using the elliptic theta function ^3, and 
we can rewrite W{(f)i) as 



x{Oh oi, Cj, 6) 



1^3(1^, g') = 1 + 2 ^ q"^^ cos 2niiu. 



Oh Cj, 6) 



(6) 



(7) 



The elliptic theta function includes an infinite series, but 
it is easily determined numerically because the convergence 
is quite fast. In Q, the normalization term (the denomina- 
tor of the right-hand side) seems to depend on i because 
x{^uOuCjXi) includes ^i, but this is not true. Because the 
elliptic theta function is a periodic function with respect to 
the argument u with period 1, and xi^u^uCj^i) can only 
take discrete values with step size 1 for the horizontal and 
vertical directions, the normalization term has the same value 
with respect to i. 

C. HR Image Prior 

Here, we introduce a "causal" Gaussian MRF prior for 
the HR image and additional latent variables. These la- 
tent variables are called the line process that controls 
the local correlation among pixel luminances. The intro- 
duction of the latent variables enables explicit expression 
of the possible discontinuity in the HR image. The line 
process, 77, consists of binary variables rji^j G {0, 1} 
for all adjacent pixel pairs i and j. Its size equals 
Njj = 2Nx — [number of HR image's horizontal pixels] — 
[number of HR image's vertical pixels]. We define the prior 
as 



p{x, 77! A, p, 1^) = p{x\r], p, i^)p{r]\X) 



(8) 



exp 



In 



27T 



Nrj In logistic (A) 



(9) 



where 



p{r]\X) = ]^ Bernoulli (r/i^j; logistic( A)) , (10) 



p{x\rj, p, = J\f{x] 0, A(r7, p, ^), (11) 

0, otherwise. 

Here, the summation ^^^j is taken over all pairs of adjacent 
pixels. The notation i ^ j means that the i-\h and j-th pixels 
are adjacent in the upward, downward, leftward, and rightward 
directions. The line process r] switches the local characteristics 
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of the prior. It indicates whether two adjacent pixels take 
similar values or independent values. When r]ij = 1, the 
i-th and the j-th pixels are strongly smoothed according to 
the quadratic penalty, whereas there is no smoothing when 
r]ij = 0. The hyperparameter A (> 0) is an edge penalty 
parameter that prevents r]ij from excessively taking edges. 
Note that A is restricted to positive values because a negative 
A leads to a reward rather than a penalty for taking edges, p 
(> 0) is a smoothness parameter that prevents the differences 
in adjacent pixel luminances from becoming large, and k, 
(> 0) is a contrast parameter that prevents x from taking an 
improperly large absolute value. On the other hand, in previous 
works (SI, m, hz is assumed to be 0, which results in an 
improper normalizing constant (see Discussion). A{r], p, hz) is 
the Nx X Nx precision matrix of x. 

We have defined the introduced causal Gaussian MRF prior 
in the joint distribution form of x and 77, i.e., p{r])p{x\r]) . We 
call such a model "causal" because rj seems to cause x. The 
MRF model is defined as having the property 



p{Xi\x\Xi, 7]) = p{Xi\Xc{i) , ViXii) ) 



(13) 



in this case; i.e., the conditional distribution of a random 
variable, Xi, given all other variables, x\xi and 77, equals the 
conditional distribution of the random variable, Xi, given its 
"neighboring" variables, x^i^ and rji^^iy If this conditional 
distribution is a Gaussian distribution, such an MRF is called 
a Gaussian MRF. 

The "compound" MRF prior is usually defined in the form 
of the Gibbs distribution lITSl . 



p{x,rj) 



exp{-H{x,rj)) 



(14) 



which is based on some microstate energy function, or equiv- 
alently, a Hamiltonian, such as 

H{x,ri) 

^ A^(l-,?,,,) + ^Y.^,^,ix,-xjf + ^\\x\\l (15) 

In addition to the property of ([T3]) , a compound MRF also has 
the property of 



(16) 



whereas the introduced "causal" Gaussian MRF prior does not. 
Therefore, we do not call the introduced prior a "compound" 
MRF prior, even though ([8]) and (O have similar forms. 
Furthermore, the introduced "causal" Gaussian MRF prior is 
a generative model, whereas the "compound" MRF is not. A 
generative model has the advantage of reducing the calculation 
cost (see Discussion). 



D. Hyperparameter Prior 

Generally, prior distributions should be non-informative 
unless we have explicit reasons because an informative prior 
leads to heuristics. Actually, we define the prior distributions 



for the hyperparameters of the HR image prior to be as non- 
informative as possible: 

p(A, p, P) = Gamma(A; bf^) Gamma(p; 6^^) 

X Gamma(/T:; b^^) Gamma(/3; a|^\ b^^)^ 

(17) 

af ^ 10-^6f ^ 10-^ af ^ lO'^bf ^ 10-^ 

a(0) ^ 10-^ ^ 10-^ ^ 10'^ bf ^ lO''. (18) 

For a gamma distribution, the number of effective prior 
observations in the Bayesian framework is equal to two times 
parameter a. As shown in the Appendix, the number of 
observations for the hyperparameter A is Nrj in this SR. Also, 
that for p and hz is Nx, and that for /3 is LNy. Therefore, the 
above settings - e.g., 2af <C Njj - are considered sufficiently 
non-informative. Superscript (0) is added because we use these 
parameters as the initial values of variational Bayes later. 

E. Registration Prior 

For the registration parameters including the blurring pa- 
rameter, we also define the corresponding prior as 

L 



(19) 



1=1 



^ [0,0,0,12/ai, ^diag[10-^ 10^10^10-^]. 

(20) 

For the rotational moti on pa rameter 61, the prior assumes 
± 1.81 degree (^VlQ-^ - 1.81). This assumption is 
considered suitable for this SR task. Similarly, an assumption 
of ± 1 pixels for translational motion parameters [oi]h and 
[di]y is considered suitable. For blurring parameter 7/, p^] is 
taken to be the value equivalent to the anti-aliasing of the scale 
factor a. 

III. Objective Function and Estimator 

A. Peak Signal-to-Noise Ratio (PSNR) 

First, we confirm that the joint distribution of all random 
variables can now be explicitly given as 

p{Y, z) = p{Y\x, /3, ^)p{x, r/|A, p, n)p{\ p, f3)p{^), 

(21) 

2;= [x,r7,[A,p,/^,/3],*], (22) 

Once the joint distribution is obtained, we can derive all 
the marginal and conditional distributions; e.g., the posterior 
distribution p{z\Y) and joint distribution of the HR and LR 
images p{Y^ x). 

One of the most commonly used evaluation functions of 
the inferred image is the L2-norm (mean square error) -based 
PSNR. It is defined as 

2^ 

PSNR(;^;a^) = lOlogio ^, (23) 

where x is the estimator of the HR image and x is the true 
HR image. Since only LR images, Y, are available for the 
estimator, we sometimes explicitly express it as a function 
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form, x{Y). Now, our objective function (functional) to be 
maximized regarding the estimator is defined as 

FixiY)) = 10 logio ^—-T • (24) 

This is because we prefer good estimator performance on 
average over various HR images and the corresponding LR 
images. Here, we assume that the occurrence rate of HR 
and LR images exactly coincides with the model we just 
introduced. 



B. Posterior Mean (PM) 

Using the above objective function, we can explicitly derive 
the best estimator of the HR image as the PM, 



argmaxF(x(r)) = {x) .^y) • 



(25) 



Here, we used the well-known fact that the PM coincides 
with the minimum mean square error estimator in Bayesian 
framework. Note that p{x\Y) needs marginalization of all 
parameters other than x over p{z\Y). If the PM of the line 
process or other model parameters is necessary, it can also be 
determined in the same manner. 

IV. Optimization Method 
A. Variational Bayes 

Though we could derive the optimal estimator, we cannot 
obtain the analytical solutions of the posterior distribution 
p{z\Y) and marginalized posterior distribution p{x\Y). Con- 
sequently, we have to rely on approximations. Here, we 
employ variational Bayes. 

Variational Bayes flT\ provides a trial distribution q{z) that 
approximates the true posterior. We impose a factorization 
assumption on the trial distribution. 



q{z) = q{x)q(r])q{\ p, P)q{^). 



(26) 



Note that, at this moment, the distribution family of each 
factorized distribution is not limited. We identify the optimal 
trial distribution that minimizes the KL divergence between 
the trial and the true distributions as the best approximation 
of the true distribution: 

q{z) = 8.TgmmDKL{q{z)\\p{z\Y)). (27) 

Actually, the trial distribution that minimizes the KL diver- 
gence, not from q{z) to p{z\Y) but from p{z\Y) to q{z) 
coincides with the product of the exact marginal distributions 
as 



argmin Dkl {p{z\Y)\\q{z)) 



IIp{z,\Y), (28) 



but this minimization is difficult to calculate. 

Under the factorization assumption of the trial distribution 
and the extremal condition of the KL divergence, each optimal 
trial distribution should satisfy the self-consistent equations. 



q{zi) (X exp(lnp(z|y))n . 



(29) 



In the common style of variational Bayes ifTOl . ll23l . this 
equation is solved by making repetitive updates. 



q^'\z,)^p{z,), 
q^'-^^\zi) (X exp{\np{z\Y))Yi^^^qi^^^y 



(30) 
(31) 



Each factorized trial distribution is supposed to converge to 
the optimal distribution. Sometimes, some q^^'^^\zj)s are used 
instead of q^\zj)s for the distribution on the right-hand side 
of (|3T1) . It depends on the hierarchical structure of the model. 
Similarly, some q^^\zi)s may not be necessary. 



B. Taylor Approximations 

Although variational Bayes is a widely used general frame- 
work, its application is difficult in practice because it requires 
a conjugate prior. The prior distributions we have introduced 
are not conjugate priors. However, we have found that simple 
Taylor approximations make them conjugate and enable the 
analytical exact expectations in (|3T1) . 

Here, to simplify the notation, we define the mean values 
of the latent variables ry, the hyper parameters A, p, /3, and 
the registration parameters (f)i over the trial distributions in the 

step number t of the updates of variational Bayes as fifj , fif, 

(t) (t) (t) (t) 

M^, 

Specifically, we use first-order Taylor approximations for 
three non-linear terms. W{(l)i) is approximated around (f)i = 



f(t) 

l,k ' 



k=l 



where 



(t) 



dW{ct>i) 



l,k 



P>Lk 



Similarly, In | A(?7, p, is approximated 

[r],\np,\nK] = [/l^®, In /if, In /if]. 



(32) 

(33) 
(34) 

around 



ln\A{rj,p,K)\c^ln A{^^l ^^f , ^^f) 

+ tr A(/x©,M®,Mf [f^fAiv - t^f>, 1,0) 
+ {\np- \n^^f)^^fA{^lf|,l,0) + (In/t - In/xf 



(35) 



We also use a similar approximation around [rj, In p, In k] = 
In pf ,\n In addition, Inlogistic(A) is approxi- 
mated around \n\ = \npf. 



Inlogistic(A) ~ lnlogistic(/Li®) 
+ (lnA-ln/x®)//®logistic(-/x®). 



(36) 



6 



C. Update Equations 

The trial distributions are obtained from (I30l)-(l32l), (l35l) , and 
(l36l) , as follows: 

9® (^7) = n Bernoulli(r,i,,- ; /x® . ), (37) 

q®(aj)=M^;M|,S|), (38) 

(A, K, 0) = Gamma(A; of , hf) Gamma(p; o® , hf) 

X Gamma(«;; af,bf) Gamma(/3; a®, 6®), 

(39) 



(40) 



^=1 



For (|3Q1) and (|3T1) , we update those distributions as follows. 
First, we compute q^^'^^\r]) using (ic, A, p, a^, /3, Sec- 
ond, we compute q^^'^^\x) using q^'^^ {'q)q^ {\^ P^^). 
Finally, we compute q^'^^ {\^ P) using q^'^^ {x^r])q^ {^) 
and using g^^+-^^(ic, r])q^\X^ p, Here, we simply 

compute only the parameters of those distributions because 
we can compute the expectations in (|3T1) analytically by using 
Taylor approximations in (l32l) , (|35]) , and (l36l) . Specific update 
equations are described in the Appendix. 

For the initial parameters of the trial distributions of rj and 
X, we use non-informative values. 



0, 



0, E 



0. 



(41) 



For the initial parameters for A, p, k,, (3 and we use the 
same values as their prior's values. 

We obtain the well- approximated PM of x as fjL^\ Real- 
istically, instead of /l^I^^ we use fi^^^ when the following 
convergence conditions hold for /i^^-' and each p^^^^, 



^iiMr^-Miii^<io- 



^=1 



<10-'^ (A: = 1,2,3,4), (42) 



where we defined o-^ = [10 ^, 10^, 10^, 10 ^] as the scaling 
constant. 

V. Experimental Results 

The proposed method was evaluated using five gray-scale 
images with a size of 40 x 40 pixels, as shown in Fig.O From 
each image, L = 10 images with a size of 10 x 10 pixels were 
created by using ([T]), Q with the settings of the parameters 
a, and (3 as the following. The resolution enhancement 
factor a was 4. The transformation parameter $ was randomly 
created according to the prior distribution in ([T9l) . The noise 
level parameter /3 was set for signal-to-noise ratios (SNR) of 
20, 25, and 30 dB for each image. Samples of the created 
images are shown in Fig. [S] 

Figure |4] shows the images estimated under SNR= 30dB. 
The resolution of each image appeared to be better than the 
corresponding observed image in Fig. [S] 

Table HI lists the quantitative results compared to those from 
the methods of bilinear interpolation, Kanemura et al. @, and 



Babacan et al. ifTOli . Note that we added a slight modification to 
these methods because they employ slightly different models. 
For example, the original method |10| assumes the blurring 
parameter 7 is known, so we set 7 as the mean value of the true 
distribution for this method. Also, we introduced a strong prior 
for A in the Kanemura method [6] in contrast to the original 
method, because this parameter sometimes becomes negative. 
We evaluated the results with regard to the expectation and 
the standard deviation of the improvement in signal-to-noise 
ratio (ISNR) over 10 experiments on each image and for each 
SNR. ISNR is the relative PSNR defined as 



ISNR = PSNR(;^; x) - PSNR(;r; x), 



(43) 



where x is the true HR image, x is the image estimated by 
the proposed method, and x is the image estimated by the 
compared method. A higher ISNR value means better im- 
provement of the estimate against the estimate of the compared 
method. We see that the ISNRs of the proposed method were 
mostly higher than those of the other methods, except for the 
comparison with the Babacan 's method in Pepper image. 

Table HI] lists the root mean square errors (RMSE) of our 
method and the other methods. To evaluate the estimated reg- 
istration parameters, we took the RMSEs over 50 experiments 
(10 experiments x 5 images) for each noise level. Of course, 
a lower RMSE value means a better estimate. We see that the 
RMSEs of the proposed method were mostly higher than those 
of the other methods. 

The calculation times of the proposed method was about 
10 minutes on an Intel Core i7 2600 processor. The proposed 
method was a little slower than the method of Babacan et al. 
1 101 and a little faster than the method of Kanemura et al. O. 

VI. Discussion 

With regard to the observation model, we used a linear trans- 
formation and AWGN. The use of the linear transformation 
model is advantageous since an arbitrary transformation matrix 
W{4>i) can be employed because of the Taylor approximation. 
The transformation matrix can be constructed by multiplying 
three matrices: the warping, blurring, and downsampling ma- 
trices 1 10|. A disadvantage of this is that sub-pixel errors might 
accumulate. We prefer matrix construction via a continuous 
function |3|. We improved the construction by introducing 
an elliptic theta function for the normalizing constant in Q. 
This normalizing constant provides fair pixel weights for both 
marginal and central areas of the HR image and faithfully 
represents the Gaussian PSF. 

With regard to the HR image prior, we used a causal 
type of prior, which was first introduced by Kanemura et 
al. |6l, fSl. The microstate energy function, or equivalently, 
the Hamiltonian, -based compound MRF prior of ([T4l) . offers 
the advantage of easy construction, but it usually has an 
exponential calculation cost, 0(2^^), for the normalizing 
constant or, equivalently, the partition function, and this is 
an obstacle to direct calculation of the PM solution. The 
MAP solution has been used in work elsewhere because it 
is not affected by the normalizing constant. In contrast, the 
introduced causal type of prior of ([5]) has only a polynomial 



(a) Lena (b) Cameraman 

Fig. 2. Five original images used in the experiments 



(c) Pepper 



(d) Clock 



(e) Text 




(a) Lena (b) Cameraman (c) Pepper (d) Clock (e) Text 

Fig. 3. Observed images when warped, blurred, downsampled by an enhancement factor of 4, and noised with SNR= 30dB AWGN 




(a) Lena (b) Cameraman 

Fig. 4. Images estimated from Fig. [3] observed images 



(c) Pepper 



(d) Clock 



(e) Text 



calculation cost 0{N^), which enables us to successfully 
apply the variational Bayes method to this problem. 

With regard to the hyperparameter priors, we also im- 
proved the existing method. As the edge penalty parameter 
A, Kanemura et al |6| implicitly assumed A G M, which leads 
to a negative A and consequently results in an edge- strewn 
image. We assumed A > by setting its prior according to 
a gamma distribution, resulting in an appropriate inference. 
As the smoothness parameter p, they practically fixed the 
value of p with a strongly informative prior. We chose a non- 
informative prior for p. We show the box and whisker plot of 
the PM for each hyperparameter over 10 experiments on each 
image under SNR= 30dB noise in Fig. [5] As can be seen, 
the inferred value of the PM of p showed wide variation, 
with an approximately 10-fold maximum-to-minimum ratio, 
depending on the original image. This result can be interpreted 
as meaning it is worth inferring p in each HR image. Further- 
more, A and k, respectively showed approximately 2-fold and 
4-fold ranges of variation. Regarding the contrast parameter 
hz, they assumed hz = 0, which leads to |A| = 0, and this 
results in an improper normalizing constant. While we assume 
hz > 0, which leads to a proper normalizing constant, we can 
consequently take the term of In | A| into account in the update 
equations of the variational Bayes. 

With regard to the prior distribution for the blurring param- 



eter 7, we used a Gaussian distribution even though 7 is a 
positive real number. This is because we selected a simpler 
expression. We tried using the prior of the gamma distribution 
as 7, but the improvement was small. One disadvantage of 
this model is that a non-informative setting for this prior may 
lead to a nonsense result where the inferred 7 is negative. 
Moreover, we employed a somewhat informative prior for 
7. This is because the blurring parameter 7 and smoothness 
hyperparameter p are somewhat complementary. This means 
that simultaneous estimation of 7 and p is difficult. Tipping 
et al. (si and Kanemura et al. |6| fixed p, and Babacan et al. 
fixed 7. 

With regard to the estimator, we logically derived the 
optimal estimator PM from the objective function of the L2- 
norm-based PSNR. The widely used joint MAP estimator can 
be considered the optimal estimator for the all-or-none type 
objective function, 

argmax {S{z - z))^^^^^^ = argmaxp(2;|l^), (44) 

z z 

where S is the Dirac delta or Kronecker delta function. 
Generally, this type of objective function is nonsensical for 
continuous variables because it is measure zero. If all the 
random variables in the posterior distribution are discrete, or if 
we can assume some smoothness of the posterior distribution. 
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TABLE I 

PSNR OF THE PROPOSED METHOD (A HIGHER VALUE IS BETTER) AND ISNRS AGAINST THREE PREVIOUS METHODS (A HIGHER VALUE IS BETTER) FOR 

DIFFERENT IMAGES AND SNR LEVELS 



Image 


SNR 


PSNR 




ISNR 






[dB] 


(proposed) 


(vs bilinear) 


(vs Kanemura) 


(vs Babacan) 


Lena 


20 


29.22 ± 0.33 


±5.35 ± 0.33 


±0.73 ± 0.36 


-0.11 ± 0.08 




25 


30.69 ± 0.25 


±6.78 ± 0.30 


±1.04 ± 0.32 


±0.11 ± 0.11 




30 


62.16 ± U.Jo 


±8.18 ± (j.69 


±l.oU ± (J. 67 


±0.48 ± U.z4 


Cameraman 


20 


21.74 ± 0.19 


±4.11 ± 0.20 


±1.06 ± 0.36 


±0.05 ± 0.07 




25 


22.68 ± 0.21 


±5.00 ± 0.23 


±1.16 ± 0.34 


-0.06 ± 0.08 




30 


23.72 ± 0.38 


±6.04 ± 0.39 


±1.81 ± 0.17 


±0.03 ± 0.06 


Pepper 


20 


29.71 ±0.37 


±3.69 ±0.35 


±0.11 ±0.17 


±0.28 ±0.12 




25 


30.69 ±0.28 


±4.57 ±0.27 


±0.28 ±0.32 


±0.06 ±0.16 




30 


31.23 ±0.42 


±5.10 ±0.43 


±0.81 ±0.75 


-0.34 ±0.48 


Clock 


20 


23.27 ±0.18 


±5.36 ±0.19 


±1.49 ±0.22 


±0.11 ±0.10 




25 


24.29 ±0.22 


±6.35 ±0.22 


±1.77 ±0.26 


±0.09 ±0.08 




30 


25.49 ±0.50 


±7.53 ±0.51 


±2.49 ±0.19 


±0.32 ±0.13 


Text 


20 


24.67 ±0.26 


±5.83 ±0.28 


±1.57 ±0.20 


-0.10 ±0.04 




25 


25.87 ±0.30 


±7.00 ±0.33 


±1.98 ±0.33 


-0.03 ±0.14 




30 


27.26 ±0.65 


±8.37 ±0.67 


±3.03 ±0.42 


±0.18 ±0.06 



a joint MAP solution will have meaning. Instead of the L2- 
norm-based objective function of PSNR, the LI -norm (mean 
absolute error) -based PSNR is sometimes employed. In such 
cases, the median of the posterior distribution is generally 
the optimal estimator. In the case of the marginalized ML, or 
equivalently, type-II ML or empirical Bayes, for example, the 
registration parameters and other hyperparameters are firstly 
inferred as: 

[A,p, ^] = argmaxp(l^|A, p, /3, $). (45) 

If these parameters have priors, such a method is called 
marginalized MAP. The HR image and sometimes the edge 
information are then inferred to as MAP, 

X = argmaxmaxp(ic, r]\Y^ A, p, ^, (46) 
X 'n 

or PM. For such a two-step inference, it is difficult to calculate 
back the objective function. 

With regard to the Taylor approximation for the transforma- 
tion matrix W{cj)i), we used the first-order approximation in 
(l32l) because it is more stable than the second-order approx- 
imation. This first-order approximation was proposed by Vil- 
lena et al. L9J . The second-order approximation was proposed 
by Pickup et al. ©, and they obtained good results. We also 
tried the second-order approximation, but it sometimes made 
the algorithm unstable because it sometimes failed to produce 
a positive definite matrix for the covariance matrix Y^x- 

With regard to the Taylor approximation for In | A(?7, p, /^)| 
and In logistic(A), we introduced the first-order approximation 
around [ry, In p, In k] = [/ifj , In /i® , In /if] and In A = In /i® , 
respectively, in (1351) and (l36l) . Note that the Taylor expansion 
not with respect to p, A, but with respect to In p, In In A is 
our key idea to solve the conjugate prior problem. Indeed, we 
could successfully derive the terms originating from In | A| in 
update equations ((l52l). (l6Ql) . and (l62l) in Appendix). Kanemura 
et al O ignored the term of In | A| because of the high 
calculation cost, and this would result in less accurate infer- 
ence. As for ?7, we implicitly assumed that r] is not a binary 
vector but a continuous vector and did the differentiation. This 
assumption is based on ([T2l) . If we make another assumption 



- i.e., replacement of r]ij with r]f j in ([T2l) - ([T2l) has the same 
meaning, but the result of the Taylor approximation will differ 
from the current form. 

With regard to the experimental results, the proposed 
method outperforms the other methods in terms of the ISNR 
for most images and noise levels. Moreover, its estimation of 
the registration parameters was more accurate than the other 
methods were for most conditions. Therefore, we conclude the 
proposed method is on the whole superior to the other meth- 
ods. Compared with bilinear interpolation and Kanemura' s 
method, the superiority of the proposed method was clear. 
Compared with the Babacan 's method, the superiority of the 
proposed method was rather slight. Especially, in the case 
of the Pepper image in 30 dB noise, the porposed method 
was worse than the Babacan 's method. This inferiority is 
considered to be caused by unstable estimation of 7 and p, 
where Babacan 's method fixed the value of 7 to the true 
expected value in our implementation. Intuitively, the Pepper 
image is smoother than the other images and has fewer edges. 
Therefore, this feature is considered to be less preferable for 
complementary parameters of 7 and p. 

With regard to the calculation cost, the proposed algorithm 
requires 0{N^). This calculation cost order is given by two 
matrix inversions: in ([55l) and A in ([52l) and ([62l) (see 

Appendix). We found that a simple approximation such as 
considering all the off-diagonal elements to be zero reduces 
the calculation time but obviously degrades accuracy. We hope 
to solve this problem in our future work. 

VII. Conclusion 

In this paper, we proposed a Bayesian image super- 
resolution (SR) method with a causal Gaussian Markov ran- 
dom field (MRF) prior. We improved existing models with 
respect to three points: 1) the combined transformation model 
through a preferable normalization term using the elliptic 
theta function, 2) the causal Gaussian MRF model through 
introduction of a contrast parameter hz, which provides an 
effective normalizing constant including ln|A|, and 3) the 
hyperparameter prior model through application of a gamma 
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Lena Cameraman Pepper 
Images 

(a) Lambda 




Lena Cameraman Pepper Clock Text 
Images 




Lena Cameraman Pepper 
Images 

(b) Rho 




Lena Cameraman Pepper Clock Text 
Images 



(c) Kappa (d) Beta 

Fig. 5. Box and whisker plot of the PM for each hyperparameter, A, p, k, and /3, and image under SNR= 30dB noise 



TABLE II 

RMSES OF REGISTRATION PARAMETERS (A LOWER VALUE IS BETTER) 
FOR DIFFERENT SNR LEVELS 



parameter 


SNR 




RMSE 






[dB] 


(proposed) 


(Kanemura) 


(Babacan) 





20 


0.006 


0.006 


0.006 




25 


0.004 


0.004 


0.004 




30 


0.002 


0.003 


0.003 




20 


0.094 


0.095 


0.094 


25 


0.054 


0.059 


0.056 




30 


0.041 


0.060 


0.046 


[o\v 


20 


0.074 


0.073 


0.076 




25 


0.044 


0.052 


0.047 




30 


0.037 


0.044 


0.036 


7 


20 


0.031 


0.033 






25 


0.025 


0.030 






30 


0.028 


0.028 





distribution for the edge penalty parameter A, which prevents 



an unfavorable edge- strewn image. We then logically derived 
the optimal estimator, that is, not the joint maximum a 
posteriori (MAP) or marginalized maximum likelihood (ML) 
but the posterior mean (PM), from the objective function of the 
L2-norm (mean square error) -based peak signal-to-noise ratio 
(PSNR). The estimator is numerically determined by using 
variational Bayes. We solved the conjugate prior problem in 
variational Bayes by introducing three Taylor approximations. 
Other than these approximations, we did not use any approxi- 
mations such as ignoring the term In | A|. Experimental results 
showed that the proposed method is mostly superior to existing 
methods in accuracy. 

Appendix 

Here, we show the details of the variational Bayes' update 
equations in Section IV-C. 
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The mean values of the hyperparameters A, p, /t:, /3 over the 
trial distributions q^\X^ k,^ (3) are given by 



(t) (t) (t) ^(t) 

The update equation of rj is given as 



(47) 



OC 



where 



(ln|A(r/,p,^)|)^,)(^^,) , (48) 



+ (49) 

= or (j, j), 
= <^ -1, {k,l) = {iJ)oT{j,i), (50) 
0, otherwise. 

Using the Taylor approximation of (1351) . we obtain the distri- 
bution of (l37l) at step t -\- 1 with the parameter 



M<*t^) = logistic (^f + ^A^^C^t), 

where 

C©^.^tr[(A(/.©,M®,Mf)-^-C^l)M,, 
The update equation of x is given as 

ocexp|^-i|a.^A(4+i),M®,M®)^ 

^ ^^Il2)g(t)(0^) f • 



(51) 



(52) 



(53) 



It becomes a Gaussian distribution. Using the Taylor approx- 
imation (l32l) , we obtain the distribution of ([38l) at step t + 1 
with the parameters 



1^1 



-, -1 



/(t) 



(=1 



(54) 
(55) 



where 



C'S, ^ [wirwP + Y.I^I],MWSVW;%. (56) 



The update equation of tz^ f3 is given as 

q^^-^^\X, p, tz, f3) (X exp {^rip{z\Y)) 



(X 



&? + ^trCr)A(/xf),l,0)|p 
I 1=1 



\ (ln|A(r7,p, ^)|)g(t+i)(^) + iVr; In logistic (A) 



V 



(57) 



Using Taylor approximations ([32l) , ([35l) , and ([36l) , we obtain 
the distribution of ([39b at step t + 1 with parameters 



afi)=af + iV^Mf logistic(-Mf), 



(58) 
(59) 



,(*) 



am = af + ^trA(/x^+i),^©,^®)-iA(/xf ),1,0) (60) 
&r = &? + ^rCrU(/.^+i),l,0), (61) 



,(*) 



(62) 
(63) 

L 



5fi) = 6« + itrCr), 



(65) 



(=1 



fc,fc' 



The update equation of $ is given as 

--p(-^E{[^'-^Sns?r[0.-/^?j 

(66) 

Using the Taylor approximation ( l32l) . we obtain the distribu- 
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tion of (|40| at step t 



1 with parameters 

(0)i-l /O) , f^)r^//(t+l) 



where 



1 



■'/(t+i)' 
4>i 



//(t+l)n 



\k,k' 



(67) 
(68) 



(69) 
(70) 
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